The primary objective of this project is to utilize data mining techniques to extract valuable insights from sales data. By examining patterns, trends, and relationships within the data, the project aims to identify opportunities for optimizing sales strategies and enhancing overall performance. Through this analysis, we seek to empower decision-makers with actionable insights that can drive business decisions.¶

data set used : https://www.kaggle.com/datasets/ahmedabbas757/dataset/data¶

problem:¶

This data mining project aims to leverage sales data analysis to uncover insights and provide it to buissnes analysts and buissnes domain uses to make good decision.¶

frist step we will clean the data¶

  • this all imports we need
In [132]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import numpy as np
from sklearn_extra.cluster import KMedoids
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
In [133]:
df = pd.read_csv('data_sales.csv')

print("Info about the data:")
print(df.info())
Info about the data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9641 entries, 0 to 9640
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Retailer          9641 non-null   object
 1   Retailer ID       9641 non-null   int64 
 2   Invoice Date      9641 non-null   object
 3   Region            9641 non-null   object
 4   State             9641 non-null   object
 5   City              9641 non-null   object
 6   Product           9641 non-null   object
 7   Price per Unit    9639 non-null   object
 8   Units Sold        9641 non-null   object
 9   Total Sales       9641 non-null   object
 10  Operating Profit  9641 non-null   object
 11  Sales Method      9641 non-null   object
dtypes: int64(1), object(11)
memory usage: 904.0+ KB
None

we will see the number the null and duplicate rows¶

In [134]:
null_count = df.isnull().sum().sum()

duplicate_count = df.duplicated().sum()

print("Number of null rows:", null_count)
print("Number of duplicate rows:", duplicate_count)
Number of null rows: 2
Number of duplicate rows: 0

we will delete it in the next code¶

In [135]:
cleand_dataset = df.dropna()
cleand_dataset = df.drop_duplicates()

print("Info about the cleaned data:")
print(cleand_dataset.info())
df==cleand_dataset
Info about the cleaned data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9641 entries, 0 to 9640
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Retailer          9641 non-null   object
 1   Retailer ID       9641 non-null   int64 
 2   Invoice Date      9641 non-null   object
 3   Region            9641 non-null   object
 4   State             9641 non-null   object
 5   City              9641 non-null   object
 6   Product           9641 non-null   object
 7   Price per Unit    9639 non-null   object
 8   Units Sold        9641 non-null   object
 9   Total Sales       9641 non-null   object
 10  Operating Profit  9641 non-null   object
 11  Sales Method      9641 non-null   object
dtypes: int64(1), object(11)
memory usage: 904.0+ KB
None
Out[135]:
Retailer Retailer ID Invoice Date Region State City Product Price per Unit Units Sold Total Sales Operating Profit Sales Method
0 True True True True True True True True True True True True
1 True True True True True True True True True True True True
2 True True True True True True True True True True True True
3 True True True True True True True True True True True True
4 True True True True True True True True True True True True
... ... ... ... ... ... ... ... ... ... ... ... ...
9636 True True True True True True True True True True True True
9637 True True True True True True True True True True True True
9638 True True True True True True True True True True True True
9639 True True True True True True True True True True True True
9640 True True True True True True True True True True True True

9641 rows × 12 columns

now we will edit in data type of colomns bec ths sales make as object not float¶

In [136]:
float_columns = ['Price per Unit', 'Units Sold', 'Total Sales', 'Operating Profit']
for col in float_columns:
    df[col] = df[col].str.replace('$', '').str.replace(',', '').astype(float)

df['Invoice Date'] = pd.to_datetime(df['Invoice Date'], errors='coerce')

print("Data types after conversion:")
print(df.dtypes)
Data types after conversion:
Retailer                    object
Retailer ID                  int64
Invoice Date        datetime64[ns]
Region                      object
State                       object
City                        object
Product                     object
Price per Unit             float64
Units Sold                 float64
Total Sales                float64
Operating Profit           float64
Sales Method                object
dtype: object

now we will delete the outline to get the true information¶

i will use the IQR to delete the outlines¶

some information about IQR¶

When using the IQR method for outlier detection and treatment, you typically have two main options: deleting the rows containing outliers or replacing the outliers with more typical values.¶

I use the IQR (replacing)¶

In [137]:
plt.figure(figsize=(15,5))
sns.set(style="whitegrid")  

for i, column in enumerate(float_columns):
    plt.subplot(1, len(float_columns), i+1)  
    sns.boxplot(data=df, x=column)
    plt.title(column)  
plt.tight_layout()  
plt.show()
No description has been provided for this image
In [138]:
for col in float_columns:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    whisker_width = 1.5
    lower_whisker = q1 - (whisker_width * iqr)
    upper_whisker = q3 + whisker_width * iqr
    df[col] = np.where(df[col] > upper_whisker, upper_whisker, np.where(df[col] < lower_whisker, lower_whisker, df[col]))  
In [139]:
plt.figure(figsize=(15,5))
sns.set(style="whitegrid")  

for i, column in enumerate(float_columns):
    plt.subplot(1, len(float_columns), i+1)  
    sns.boxplot(data=df, x=column)
    plt.title(column)  
plt.tight_layout()  
plt.show()
No description has been provided for this image
In [140]:
 
df['Product'] = df['Product'].replace("Men's aparel", "Men's Apparel") 
df = df[df['Units Sold'] != 0]
df['Total Sales']=df['Price per Unit'] * df['Units Sold']
df['profit_percentage'] = (df['Operating Profit'] / df['Total Sales']) * 100
df["profit_percentage"] = df['profit_percentage'].astype('float').round()
df['Operating Profit'] = df['Total Sales'] * (df['profit_percentage'] / 100)
df.drop(columns = ['profit_percentage'], inplace = True)
df[float_columns] = df[float_columns].fillna(df[float_columns].mean())
df.to_csv('modified_data_sales.csv', index=False)

replace the value "Men's aparel" in the "Product" column of the DataFrame with "Men's Apparel".¶

remove rows where the value in the "Units Sold" column is equal to zero.¶

calculate a new column called "Total Sales" by multiplying the values in the "Price per Unit" and "Units Sold" columns.¶

calculate the profit percentage using the "Operating Profit" and "Total Sales" columns and rounds it to the nearest integer.¶

recalculate the operating profit based on the calculated profit percentage and stores the results in the "Operating Profit" column.¶

filling missing values in numerical columns with the mean value.¶

EDA¶

In [141]:
#to display first rows 
print(df.head())
#information about data
print(df.info())
# Statistical summary
print(df.describe())
        Retailer  Retailer ID Invoice Date     Region      State         City  \
0        Walmart      1128299   2021-06-17  Southeast    Florida      Orlando   
1      West Gear      1128299   2021-07-16      South  Louisiana  New Orleans   
2  Sports Direct      1197831   2021-08-25      South    Alabama   Birmingham   
3  Sports Direct      1197831   2021-08-27      South    Alabama   Birmingham   
4  Sports Direct      1197831   2021-08-21      South    Alabama   Birmingham   

                   Product  Price per Unit  Units Sold  Total Sales  \
0          Women's Apparel            85.0       218.0      18530.0   
1          Women's Apparel            85.0       163.0      13855.0   
2    Men's Street Footwear            10.0       700.0       7000.0   
3  Women's Street Footwear            15.0       575.0       8625.0   
4  Women's Street Footwear            15.0       475.0       7125.0   

   Operating Profit Sales Method  
0           1297.10       Online  
1            831.30       Online  
2           3150.00       Outlet  
3           3881.25       Outlet  
4           3206.25       Outlet  
<class 'pandas.core.frame.DataFrame'>
Index: 9637 entries, 0 to 9640
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Retailer          9637 non-null   object        
 1   Retailer ID       9637 non-null   int64         
 2   Invoice Date      9637 non-null   datetime64[ns]
 3   Region            9637 non-null   object        
 4   State             9637 non-null   object        
 5   City              9637 non-null   object        
 6   Product           9637 non-null   object        
 7   Price per Unit    9637 non-null   float64       
 8   Units Sold        9637 non-null   float64       
 9   Total Sales       9637 non-null   float64       
 10  Operating Profit  9637 non-null   float64       
 11  Sales Method      9637 non-null   object        
dtypes: datetime64[ns](1), float64(4), int64(1), object(6)
memory usage: 978.8+ KB
None
        Retailer ID                   Invoice Date  Price per Unit  \
count  9.637000e+03                           9637     9637.000000   
mean   1.173846e+06  2021-05-10 16:52:11.929023488       45.145719   
min    1.128299e+06            2020-01-01 00:00:00        7.000000   
25%    1.185732e+06            2021-02-17 00:00:00       35.000000   
50%    1.185732e+06            2021-06-04 00:00:00       45.000000   
75%    1.185732e+06            2021-09-16 00:00:00       55.000000   
max    1.197831e+06            2021-12-31 00:00:00       85.000000   
std    2.636304e+04                            NaN       14.473482   

        Units Sold   Total Sales  Operating Profit  
count  9637.000000   9637.000000       9637.000000  
mean    250.025734  12037.611520       3029.362764  
min       6.000000    160.000000          8.000000  
25%     106.000000   4068.000000        191.520000  
50%     176.000000   7805.000000        440.000000  
75%     350.000000  15750.000000       5200.000000  
max     716.000000  60860.000000      12888.000000  
std     194.848704  11495.128247       4160.986010  

pair plot to visualize every single column¶

In [78]:
warnings.filterwarnings("ignore", category=FutureWarning)
sns.pairplot(df[['Total Sales', 'Operating Profit', 'Price per Unit']])
plt.show()
No description has been provided for this image

this matrix to visualize the correlation between columns¶

The distribution of sales is skewed to the right, indicating that there were more instances of lower sales figures than higher sales figures.¶

In [79]:
selected_columns = ['Price per Unit', 'Units Sold', 'Total Sales','Operating Profit']
new_data = df[selected_columns].copy()
In [80]:
correlation_matrix = new_data.corr()
plt.figure(figsize=(6, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

we found that total sales and price are most correlated but still low correlation¶

to visualize the distrbution of units sales¶

In [81]:
df['Units Sold'].unique()[-50:]
df['Units Sold'] = df['Units Sold'].astype('int')
sns.histplot(data = df, x = "Units Sold", kde=True)
plt.show()
No description has been provided for this image

this plot to visualize top state in sales¶

the most frequent number of units sold is between 100 and 200¶

distribution is skewed to the right, meaning there are more instances of lower sales figures than higher sales figures.¶

In [82]:
plt.figure(figsize = (15,6))
graph = sns.countplot(x = "State", data = df, order = df.State.value_counts()[:20].index, palette = "RdBu")
for container in graph.containers:
    graph.bar_label(container)
plt.xticks(rotation = 45)
plt.show()
No description has been provided for this image

California has the most visitors, followed by Texas and New York.¶

The number of visitors from each state varies considerably. There are many more visitors from California than from any other state.¶

In [83]:
# Read the CSV file and select the columns you want

selected_columns = ['Total Sales', 'Operating Profit', 'Price per Unit']  
data = df[selected_columns]

# Convert the data to a numpy array
data_a = np.array(data)
def optimal_k(data, max_clusters=4):
    inirtias = []
    for i in range(1, max_clusters + 1):
        kmedoids = KMedoids(n_clusters=i, random_state=0).fit(data)
        inirtias.append(kmedoids.inertia_)
    mininertia=min(inirtias)
    plt.plot(range(1, 5),inirtias , marker='o')
    plt.xlabel('Number of clusters')
    plt.ylabel('Inertia')
    plt.title('Elbow Method')
    plt.show()
    return inirtias.index(mininertia)+1
best = optimal_k(data_a)
print(best)

# Define the number of clusters
k = best

# Perform k-medoids clustering
kmedoids = KMedoids(n_clusters=k).fit(data_a)
clusters = kmedoids.cluster_centers_
labels = kmedoids.labels_

print("Labels: ", labels, "\n")
print("Cluster Centers: ", clusters, "\n")
No description has been provided for this image
4
Labels:  [1 0 0 ... 0 0 0] 

Cluster Centers:  [[9.0090e+03 7.2072e+02 6.3000e+01]
 [1.9500e+04 6.8250e+03 6.0000e+01]
 [3.5800e+04 1.1814e+04 5.0000e+01]
 [3.4680e+03 2.0808e+02 3.4000e+01]] 

In [84]:
plt.figure(figsize=(8, 6))

for j in range(k):
    cluster_points = data_a[labels == j]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f'Cluster {j}')

plt.scatter(clusters[:, 0], clusters[:, 1], c='black', marker='x', s=100, label='Cluster Centers')
plt.xlabel(selected_columns[0])
plt.ylabel(selected_columns[1])
plt.title('K-medoids Clustering')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

we note that clusters arranged based on sales preformance from plot as:¶

1- cluster 2¶

2- cluster 1¶

3- cluster 0¶

4- cluster 3¶

In [ ]:
X = df[['Price per Unit', 'Total Sales']]
threshold = 1  # Set the distance threshold to determine clusters
z2 = linkage(X, method='single', metric='euclidean')
clusters = fcluster(z2, t=threshold, criterion='distance')
plt.figure(figsize=(10, 6))
plt.scatter(X['Price per Unit'], X['Total Sales'], c=clusters, cmap='viridis')
plt.xlabel('Price per Unit')
plt.ylabel('Total Sales')
plt.title('Scatter Plot of Hierarchical Clustering')
plt.show()
In [131]:
num_clusters = len(set(fcluster(z2, t=1, criterion='distance')))
print("Number of clusters:", num_clusters)

plt.figure(figsize=(10, 6))
dendrogram(z2)
plt.title('Dendrogram of Hierarchical Clustering')
plt.xlabel('Columns')
plt.ylabel('Distance')
plt.show()
Number of clusters: 8
No description has been provided for this image

we will considre the 4 clusters in kmedodids to mesuare sales preformance¶

1- cluster 2 :excellent preformance¶

2- cluster 1 :upper mid preformance¶

3- cluster 0 :mid preformance¶

4- cluster 3 :low preformance¶

In [87]:
# Loop through each cluster
for j in range(k):
  # Filter data points belonging to the cluster
  cluster_data = data_a[labels == j]
  fig, axes = plt.subplots(len(selected_columns),1, figsize=(10, 12))  # Set figure size

  # Plot histograms for each feature on separate subplots
  for i, col in enumerate(selected_columns):
    axes[i].hist(cluster_data[:, i])
    axes[i].set_title(f"Distribution of {col} in Cluster {j}")
    axes[i].set_xlabel(col)
    axes[i].set_ylabel("Frequency")

  fig.suptitle(f"Distribution of Features in Cluster {j}")
  plt.tight_layout()  
  plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [88]:
max_sales_by_retailer = df.groupby('Retailer')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_retailer)
Retailer
Amazon           53700.0
Foot Locker      60860.0
Kohl's           47250.0
Sports Direct    53700.0
Walmart          60860.0
West Gear        60860.0
Name: Total Sales, dtype: float64
In [89]:
total_sales_by_retailer = df.groupby('Retailer')['Total Sales'].sum()

# Print the total sales for each retailer
print(total_sales_by_retailer)
Retailer
Amazon           1.004179e+07
Foot Locker      2.763377e+07
Kohl's           1.333598e+07
Sports Direct    2.389743e+07
Walmart          9.721602e+06
West Gear        3.137590e+07
Name: Total Sales, dtype: float64
In [90]:
max_retailer = total_sales_by_retailer.idxmax()
# Print the total sales for each retailer
print(max_retailer)
West Gear

we found that max retailer in sales is west gear¶

In [91]:
max_sales_by_region = df.groupby('Region')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_region)
Region
Midwest      53700.0
Northeast    50120.0
South        60860.0
Southeast    60860.0
West         60860.0
Name: Total Sales, dtype: float64
In [92]:
total_sales_by_region = df.groupby('Region')['Total Sales'].sum()

# Print the total sales for each retailer
print(total_sales_by_region)
Region
Midwest      1.655985e+07
Northeast    2.393010e+07
South        1.982409e+07
Southeast    2.007722e+07
West         3.561519e+07
Name: Total Sales, dtype: float64
In [93]:
max_region = total_sales_by_region.idxmax()
# Print the total sales for each retailer
print(max_region)
West

we found that west region occupies most sales¶

In [94]:
df['Month'] = df['Invoice Date'].dt.month
df['Month']
Out[94]:
0        6
1        7
2        8
3        8
4        8
        ..
9636    11
9637    10
9638    10
9639     4
9640    10
Name: Month, Length: 9637, dtype: int32
In [95]:
def find_seasons(monthNumber):
    if monthNumber in [12, 1, 2]:
        return 'Winter'
    
    elif monthNumber in [3, 4, 5]:
        return 'Spring'
    
    elif monthNumber in [6, 7, 8]:
        return 'Summer'
    
    elif monthNumber in [9, 10, 11]:
        return 'Autumn'
    
df['Season'] = df['Month'].apply(find_seasons)
df['Season']
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name()
In [96]:
max_sales_by_Season = df.groupby('Season')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_Season)
Season
Autumn    53700.0
Spring    53125.0
Summer    60860.0
Winter    60860.0
Name: Total Sales, dtype: float64
In [97]:
total_sales_by_Season = df.groupby('Season')['Total Sales'].sum()

# Print the total sales for each retailer
print(total_sales_by_Season)
Season
Autumn    2.725748e+07
Spring    2.705445e+07
Summer    3.330763e+07
Winter    2.838690e+07
Name: Total Sales, dtype: float64
In [98]:
max_Season = total_sales_by_Season.idxmax()
# Print the total sales for each retailer
print(max_Season)
Summer
In [99]:
def groupData(columnName):
    return df.groupby(columnName).agg({'Total Sales' : sum, 'Operating Profit' : 'sum'})
In [100]:
SeasonSales = groupData('Season').sort_values(by = 'Total Sales', ascending = False)

# set size to plot
plt.figure(figsize = (15,6))

# create plot of Total Sales 
plt.subplot(1, 2, 1)
sns.lineplot(x = SeasonSales.index, y = "Total Sales", data = SeasonSales, marker = "o")


# Create plot of Operating Profit
plt.subplot(1, 2, 2)
sns.lineplot(x = SeasonSales.index, y = "Operating Profit", data = SeasonSales, marker='o')

plt.show()
No description has been provided for this image

we found Sales increase in the summer signficantly¶

In [101]:
max_sales_by_salesmethod = df.groupby('Sales Method')['Total Sales'].max()
# Print the maximum sales for each retailer
print(max_sales_by_salesmethod)
Sales Method
In-store    60860.0
Online      60860.0
Outlet      60860.0
Name: Total Sales, dtype: float64
In [102]:
total_sales_by_salesmethod = df.groupby('Sales Method')['Total Sales'].sum()

# Print the total sales for each retailer
print(total_sales_by_salesmethod)
Sales Method
In-store    3.450676e+07
Online      4.407039e+07
Outlet      3.742931e+07
Name: Total Sales, dtype: float64
In [103]:
max_salesmethod = total_sales_by_salesmethod.idxmax()
# Print the total sales for each retailer
print(max_salesmethod)
Online
In [104]:
SalesMethod = df.groupby('Sales Method')['Total Sales'].sum().sort_values(ascending = False)

# set size to plot
plt.figure(figsize = (8,4))

# create plot of Total Sales 
sns.lineplot(x = SalesMethod.index, y = SalesMethod.values, data = SalesMethod, marker = "o")

plt.show()
No description has been provided for this image

we found that online sales are much better in sales¶

In [105]:
df['Year'] = df['Invoice Date'].dt.year
df['Year']
Out[105]:
0       2021
1       2021
2       2021
3       2021
4       2021
        ... 
9636    2021
9637    2021
9638    2021
9639    2021
9640    2021
Name: Year, Length: 9637, dtype: int32
In [106]:
sales_by_month = df.groupby(['Year','Month'])['Total Sales'].sum().reset_index()

# create plot
plt.figure(figsize = (12,6))
sns.lineplot(x = "Month", y = "Total Sales", hue = "Year", data = sales_by_month, marker='o')

plt.show()
No description has been provided for this image

we found a signficant increase in sales in 2021 this is due to covid-19 epidimic¶

In [107]:
cluster_counts = np.zeros(k, dtype=int)
for i in range(len(labels)):
    cluster_counts[labels[i]] += 1
In [108]:
for j in range(k):
    print("Cluster ", j, " Count: ", cluster_counts[j],"points")
Cluster  0  Count:  3056 points
Cluster  1  Count:  1508 points
Cluster  2  Count:  1134 points
Cluster  3  Count:  3939 points
In [112]:
from sklearn.metrics import pairwise_distances
for j in range(k):
    cluster_points = data_a[labels == j]  
    distances = pairwise_distances(cluster_points, metric='euclidean')  
    avg_distance = np.mean(distances) 
    print("Cluster", j, "Average Distance:", avg_distance)
Cluster 0 Average Distance: 3418.5170503475556
Cluster 1 Average Distance: 5701.029289463491
Cluster 2 Average Distance: 9111.804274489772
Cluster 3 Average Distance: 1869.1340108374193

clustre 2(excellent sales preformance) are not strongly related as avg distances is high¶

clustre 3(low sales preformance) are strongly related as avg dist is low¶

In [113]:
clustered_data = pd.DataFrame(data, columns=selected_columns)
clustered_data['Cluster'] = labels


cluster_stats = clustered_data.groupby('Cluster').agg(['mean', 'median', 'std', 'min', 'max', 'count'])


print(cluster_stats)
          Total Sales                                                \
                 mean   median          std      min      max count   
Cluster                                                               
0         9535.300466   9076.5  2348.673728   6125.0  16065.0  3056   
1        19979.690981  19500.0  3914.688581  12500.0  28125.0  1508   
2        37825.542328  35800.0  7834.021913  27000.0  60860.0  1134   
3         3514.355166   3500.0  1533.750641    160.0   6256.0  3939   

        Operating Profit                                                  \
                    mean    median          std      min       max count   
Cluster                                                                    
0            1420.158467    513.12  1535.105625   103.04   7393.75  3056   
1            6973.750749   7000.00  2548.591213   489.60  12801.25  1508   
2           11673.026808  12565.80  1690.530849  4462.50  12888.00  1134   
3             279.344034    162.06   427.941554     8.00   3000.00  3939   

        Price per Unit                                      
                  mean median        std   min   max count  
Cluster                                                     
0            47.758603   47.0  12.754951  10.0  85.0  3056  
1            49.410477   50.0  12.419693  20.0  85.0  1508  
2            60.568783   60.0  11.835446  40.0  85.0  1134  
3            37.045697   37.0  11.859960   7.0  73.0  3939  
In [114]:
from sklearn.metrics import pairwise_distances


centroids = kmedoids.cluster_centers_

centroid_distances = pairwise_distances(centroids)

cluster_distances = []
for cluster in np.unique(labels):
    cluster_indices = np.where(labels == cluster)[0]
    cluster_data = data_a[cluster_indices]
    cluster_distances.append(np.mean(pairwise_distances(cluster_data)))

davies_bouldin_scores = []
for i in range(k):
    db_score = 0
    for j in range(k):
        if i != j:
            db_score += (cluster_distances[i] + cluster_distances[j]) / centroid_distances[i, j]
    db_score /= (k - 1)
    davies_bouldin_scores.append(db_score)

avg_db_index = np.mean(davies_bouldin_scores)
print(f"\nAverage Davies-Bouldin Index across clusters: {avg_db_index:.4f}")
Average Davies-Bouldin Index across clusters: 0.6265

A low value of Davies-Bouldin Index score indicates better clstering¶

In [115]:
from sklearn.metrics import silhouette_samples
import numpy as np

# Calculate Silhouette Coefficient for each sample (data point)
sample_silhouette_values = silhouette_samples(data, kmedoids.labels_)

# Assign Silhouette Coefficient values to each data point in the DataFrame
clustered_data['Silhouette Coefficient'] = sample_silhouette_values

# Print the Silhouette Coefficient for each cluster
print("Silhouette Coefficient for each cluster:")
for cluster in np.unique(kmedoids.labels_):
    cluster_indices = np.where(kmedoids.labels_ == cluster)[0]
    silhouette_cluster = np.mean(sample_silhouette_values[cluster_indices])
    print(f"Cluster {cluster}: Silhouette Coefficient = {silhouette_cluster:.4f}")
Silhouette Coefficient for each cluster:
Cluster 0: Silhouette Coefficient = 0.3784
Cluster 1: Silhouette Coefficient = 0.4322
Cluster 2: Silhouette Coefficient = 0.4574
Cluster 3: Silhouette Coefficient = 0.6804

A higher value of silhouette score indicates better clstering so cluster 3 is better in clustring than other clusters¶

In [116]:
# herchial silhouette_score
from sklearn.metrics import silhouette_score
silhouette_avg = silhouette_score(X, clusters)
print(f"Silhouette Score: {silhouette_avg:.4f}")
Silhouette Score: 0.7092
In [117]:
#k-medoids silhouette_score
silhouette_avg = silhouette_score(data_a, labels)
print("The average silhouette_score is :", silhouette_avg)
The average silhouette_score is : 0.5195491103947028

The silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.¶

so we found that hierarchical clustring technique is better in our case¶

In [118]:
from sklearn.metrics import calinski_harabasz_score

# Calculate Calinski-Harabasz Index for k-medoids
calinski_score = calinski_harabasz_score(data_a, labels)
print("The Calinski-Harabasz Index is:", calinski_score)
The Calinski-Harabasz Index is: 29891.74369942133
In [119]:
# for herachial
calinski_score = calinski_harabasz_score(X, clusters)
print("The Calinski-Harabasz Index is:", calinski_score)
The Calinski-Harabasz Index is: 97153242683.56349

The Calinski-Harabasz Index compares the variance within clusters to the variance between clusters. A higher index signifies better separation between clusters.¶

we found that hierarchical clustring have much higher score so it have much better separtion between clusters¶

in conclsion hierarchical clustring have better clusters quality¶

kernel density estimation (KDE) to estimate the density distribution of each cluster and identify regions of overlap.¶

In [120]:
import numpy as np
from sklearn.neighbors import KernelDensity

def calculate_cluster_densities(data, labels):
    cluster_densities = []
    unique_labels = np.unique(labels)
    
    for label in unique_labels:
        # Extract data points belonging to the current cluster
        cluster_data = data[labels == label]
        
        # Fit KDE model for the current cluster
        kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
        kde.fit(cluster_data)
        
        # Evaluate KDE at each data point
        densities = np.exp(kde.score_samples(cluster_data))
        
        # Store the density estimates for the current cluster
        cluster_densities.append(densities)
    
    return cluster_densities

# Example usage
# Assuming 'data' is your data array and 'labels' are the cluster labels
cluster_densities = calculate_cluster_densities(data_a, labels)

# Compute average density for each cluster
avg_cluster_densities = [np.mean(densities) for densities in cluster_densities]

# Print or visualize the average densities to identify regions of overlap
print("Average densities for each cluster:", avg_cluster_densities)
Average densities for each cluster: [0.0004301624048650191, 0.0014603683686582874, 0.008904818071545388, 0.00020793674683601603]

Cluster 3 has the highest average density, indicating that the data points within this cluster are tightly packed together, forming a high-density region. This suggests that there's likely a distinct and well-defined cluster.¶

Clusters 1, 2, and 4 have lower average densities compared to Cluster 3, indicating that the data points within these clusters are less densely packed,This could imply that these clusters may have more overlap with neighboring clusters or may not be as well-separated.¶

In [ ]:
 
In [ ]:
 

west retailers have better sales preformance¶

summer season has very much better sales preformance so to inc sales we can increase products production¶

online sales is better than other methods so we can inc money on advertising our website to inc sales to spread our prouducts gloably¶

most of customers is ethier (mid sales pref.) or (low sales pref.) and this indicate that sales needs improvment so we want to increase and target upper mid and excellent customers¶

all of this patterns can help buisness owners make dicisions to increase sales preformance¶